Generating Images From Spoken Descriptions
نویسندگان
چکیده
Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages estimated be lacking a commonly used written form. Consequently, these cannot benefit text-based technologies. This paper presents 1) new speech technology task, i.e., speech-to-image generation (S2IG) framework which translates descriptions photo-realistic images 2) without using any information, thus allowing unwritten potentially this technology. The proposed framework, referred S2IGAN, consists embedding network relation-supervised densely-stacked generative model. learns embeddings with supervision corresponding visual information images. model synthesizes images, conditioned on produced by network, that semantically consistent spoken descriptions. Extensive experiments conducted four public benchmark databases: two databases in text-to-image tasks, CUB-200 Oxford-102 for we created synthesized descriptions, natural often field cross-modal learning Flickr8k Places. Results demonstrate effectiveness S2IGAN synthesizing high-quality semantically-consistent signal, yielding good performance solid baseline S2IG task.
منابع مشابه
Generating Tailored, Comparative Descriptions in Spoken Dialogue
We describe an approach to presenting information in spoken dialogues that for the first time brings together multi-attribute decision models, strategic content planning, state-of-the-art dialogue management, and realization which incorporates prosodic features. The system selects the most important subset of available options to mention and the attributes that are most relevant to choosing bet...
متن کاملMidge: Generating Descriptions of Images
We demonstrate a novel, robust vision-tolanguage generation system called Midge. Midge is a prototype system that connects computer vision to syntactic structures with semantic constraints, allowing for the automatic generation of detailed image descriptions. We explain how to connect vision detections to trees in Penn Treebank syntax, which provides the scaffolding necessary to further refine ...
متن کاملImage2speech: Automatically generating audio descriptions of images
This paper proposes a new task for artificial intelligence. The image2speech task generates a spoken description of an image. We present baseline experiments in which the neural net used is a sequence-to-sequence model with attention, and the speech synthesizer is clustergen. Speech is generated from four different types of segmentations: two that require a language with known orthography (word...
متن کاملGenerating Descriptions of Spatial Relations between Objects in Images
We investigate the task of predicting prepositions that can be used to describe the spatial relationships between pairs of objects depicted in images. We explore the extent to which such spatial prepositions can be predicted from (a) language information, (b) visual information, and (c) combinations of the two. In this paper we describe the dataset of object pairs and prepositions we have creat...
متن کاملGenerating Auction Conngurations from Declarative Contract Descriptions
This work presents an approach to automating the negotiation of business contracts and describes an implementation of a subset of this overall goal. To support automated contract negotiation, we are developing a language for both (1.) fully-speciied, executable contracts and (2.) partially-specied contracts that are in the midst of being negotiated, speciically via automated auctions. The langu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing
سال: 2021
ISSN: ['2329-9304', '2329-9290']
DOI: https://doi.org/10.1109/taslp.2021.3053391